10 - Deep Learning - Plain Version 2020 [ID:21053]
50 von 129 angezeigt

Welcome everybody to deep learning. So today we want to continue talking about the different

losses and optimization. We want to go ahead and talk a bit about the details of these

interesting problems. Let's talk first about the loss functions. Loss functions are generally

used for different tasks and for different tasks you have different loss functions. The

two most important tasks that we are facing are regression and classification. So in classification

you want to estimate a discrete variable for every input. This means that you want to essentially

decide in this two class problem here on the left whether it's blue or red dots. So you

need to model a decision boundary. In regression the idea is that you want to model a function

that explains your data. So you have some input function let's say x2 and you want

to predict x1 from it. To do so you compute a function that will produce appropriate values

of x1 for any given x2. Here you can see this is a line fit. We talked about activation

functions, last activation as softmax and coarse entropy loss. Somehow we combined them and

obviously there is a difference between the last activation function in our network and

the loss function. The last activation function is applied to the individual samples x of

the batch. It will also be present at training and testing time. So the last activation function

will become part of the network and will remain there to produce the output or the prediction.

It generally produces a vector. Now the loss function combines all m samples and labels.

In their combination they produce a loss that describes how good the fit is. So it's only

present during training time and the loss is generally a scalar value that describes

how good the fit is. So you only need it during training time. Interestingly many of those

loss functions can be put in a probabilistic framework. This leads us to the maximum likelihood

estimation. In maximum likelihood estimation, just as a reminder, we consider everything

to be probabilistic. So we have a set of observations, capital X, that consists of individual observations.

Then we have associated labels. They also stem from some distribution and the observations

are denoted as y. Of course we need a conditional probability density function that describes

us somehow how y and x are related. In particular we can compute the probability for y given

some observation x. This will be very useful for example if you want to decide on a specific

class. Now we have to somehow model this data set. They are drawn from some distribution

and the joint probability for the given data set can then be computed as a product over

the individual conditional probabilities. Of course if they are independent and identically

distributed you can simply write this up as a large product over the entire training data

set. So you end up with this product over all m samples where it's just a product of

the conditionals. This is useful because we can determine the best parameters by maximizing

the joint probability over the entire training data set. We have to do it by evaluating this

large product. Now this large product has a couple of problems. In particular if we

have high and low values they may cancel out very quickly. So it may be interesting to

transform the entire problem into the logarithmic domain. Because the logarithm is a monotonous

transformation it doesn't change the position of the maximum. Hence we can use the log function

and the negative sign to flip the maximization into a minimization. Instead of looking at

the likelihood function we can look at the negative log likelihood function. Then our

large product is suddenly a sum over all the observations, the negative logarithm of the

conditional probabilities. Now we can look at a univerid Gaussian model. So now we are

in the one dimensional domain again and we can model this with a normal distribution

where we would then choose the output of our network as the expected value and one over

beta as the standard deviation. If we do so we can find the following formulation. Square

root of beta over square root of 2 pi times the exponential function of minus beta times

the label minus the prediction to the power of 2 divided by 2. Ok, so let's go ahead

and put this in our log likelihood function. Remember this is really something that you

should know in the written exam. Everybody needs to know the normal distribution and

everybody needs to be able to convert this kind of univariate Gaussian distribution into

Teil einer Videoserie :

Zugänglich über

Offener Zugang

Dauer

00:15:11 Min

Aufnahmedatum

2020-10-10

Hochgeladen am

2020-10-10 12:46:20

Sprache

en-US

Deep Learning - Loss and Optimization Part 1

This video explains how to derive L2 Loss and Cross-Entropy Loss from statistical assumptions. Highly relevant for the oral exam!

For reminders to watch the new video follow on Twitter or LinkedIn.

Further Reading:
A gentle Introduction to Deep Learning

Einbetten
Wordpress FAU Plugin
iFrame
Teilen